In this project we are going to predict boston area price using machine learning algorithms.
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss, accuracy_score, r2_score, mean_squared_error, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from scipy.stats import expon, randint
EDA stands for Exploratory Data Analysis, which is the process of analyzing and summarizing data sets to understand their main characteristics. EDA is a preliminary step in data analysis that helps to identify patterns, relationships, and anomalies in the data before applying any statistical or machine learning techniques.
First, let's read our dataset file.
file_path = os.path.join(os.getcwd(), 'DataSet.xlsx')
boston_house = pd.read_excel(file_path)
Now, we can use the shape attribute to check how large the resulting DataFrame is
boston_house.shape
(506, 14)
We can examine the contents of the house DataFrame using the head() command, which grabs the first five rows:
boston_house.head()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | NaN | 2.94 | 33.4 |
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
boston_house
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | NaN | 2.94 | 33.4 |
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 501 | 0.06263 | 0.0 | 11.93 | 0.0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
| 502 | 0.04527 | 0.0 | 11.93 | 0.0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
| 503 | 0.06076 | 0.0 | 11.93 | 0.0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
| 504 | 0.10959 | 0.0 | 11.93 | 0.0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
| 505 | 0.04741 | 0.0 | 11.93 | 0.0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
506 rows × 14 columns
Now, let's generate high-level summary of the attributes in dataframe using describe and info method
boston_house.describe(include='all')
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.000000 | 506.000000 | 506.000000 | 480.000000 | 506.000000 | 506.000000 | 506.000000 | 479.000000 | 506.000000 | 506.000000 | 506.000000 | 486.000000 | 506.000000 | 452.000000 |
| mean | 1.269195 | 13.295257 | 9.205158 | 0.175000 | 1.101175 | 15.679800 | 58.744660 | 6.211663 | 78.063241 | 339.317787 | 42.614980 | 336.820947 | 11.537806 | 23.750442 |
| std | 2.399207 | 23.048697 | 7.169630 | 0.380364 | 1.646991 | 27.220206 | 33.104049 | 6.527286 | 203.542157 | 180.670077 | 87.585243 | 121.174519 | 6.064932 | 8.808602 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.385000 | 3.561000 | 1.137000 | 1.129600 | 1.000000 | 20.200000 | 2.600000 | 0.320000 | 1.730000 | 6.300000 |
| 25% | 0.049443 | 0.000000 | 3.440000 | 0.000000 | 0.449000 | 5.961500 | 32.000000 | 2.425900 | 4.000000 | 254.000000 | 17.000000 | 370.415000 | 6.877500 | 18.500000 |
| 50% | 0.144655 | 0.000000 | 6.960000 | 0.000000 | 0.538000 | 6.322500 | 65.250000 | 3.917500 | 5.000000 | 307.000000 | 18.900000 | 390.885000 | 10.380000 | 21.950000 |
| 75% | 0.819623 | 18.100000 | 18.100000 | 0.000000 | 0.647000 | 6.949000 | 89.975000 | 6.341400 | 24.000000 | 403.000000 | 20.200000 | 395.630000 | 15.015000 | 26.600000 |
| max | 9.966540 | 100.000000 | 27.740000 | 1.000000 | 7.313000 | 100.000000 | 100.000000 | 24.000000 | 666.000000 | 711.000000 | 396.900000 | 396.900000 | 34.410000 | 50.000000 |
If you notice RAD, CHAS columns are categorical. let's describe this separately.
boston_house['RAD'].astype('category').describe()
count 506 unique 10 top 5 freq 115 Name: RAD, dtype: int64
boston_house['CHAS'].astype('category').describe()
count 480.0 unique 2.0 top 0.0 freq 396.0 Name: CHAS, dtype: float64
Now, let's use info to describe our dataset characteristic
boston_house.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 506 non-null float64 1 ZN 506 non-null float64 2 INDUS 506 non-null float64 3 CHAS 480 non-null float64 4 NOX 506 non-null float64 5 RM 506 non-null float64 6 AGE 506 non-null float64 7 DIS 479 non-null float64 8 RAD 506 non-null int64 9 TAX 506 non-null float64 10 PTRATIO 506 non-null float64 11 B 486 non-null float64 12 LSTAT 506 non-null float64 13 MEDV 452 non-null float64 dtypes: float64(13), int64(1) memory usage: 55.5 KB
Now, let's count null values and their proportion
nan_cols = boston_house.isnull()
count_nan_per_column = nan_cols.sum()
count_nan_per_column
CRIM 0 ZN 0 INDUS 0 CHAS 26 NOX 0 RM 0 AGE 0 DIS 27 RAD 0 TAX 0 PTRATIO 0 B 20 LSTAT 0 MEDV 54 dtype: int64
proportion_nan_per_column = nan_cols.mean()
proportion_nan_per_column
CRIM 0.000000 ZN 0.000000 INDUS 0.000000 CHAS 0.051383 NOX 0.000000 RM 0.000000 AGE 0.000000 DIS 0.053360 RAD 0.000000 TAX 0.000000 PTRATIO 0.000000 B 0.039526 LSTAT 0.000000 MEDV 0.106719 dtype: float64
Now let's plot unqieu vlaues per attribute
unique_counts = boston_house.nunique()
print("Unique values per column:\n", unique_counts)
plt.figure(figsize=(10, 6))
unique_counts.plot(kind='bar', color='skyblue')
plt.title('Number of Unique Values per Column')
plt.xlabel('Columns')
plt.ylabel('Count of Unique Values')
plt.xticks(rotation=0)
plt.show()
Unique values per column: CRIM 452 ZN 27 INDUS 77 CHAS 2 NOX 132 RM 437 AGE 399 DIS 343 RAD 10 TAX 67 PTRATIO 85 B 357 LSTAT 445 MEDV 210 dtype: int64
for attribute in boston_house.columns:
plt.figure(figsize=(20, 10))
boston_house[attribute].value_counts().plot(kind='bar', color='skyblue')
plt.title(f'Values count for {attribute}')
plt.xlabel(attribute)
plt.ylabel('Frequency')
plt.show()
The number of unique values in a column can significantly impact its value in machine learning, depending on the type of model you are building and the nature of your data. Here are some general guidelines:
The value of a column in machine learning is context-dependent. By analyzing the number of unique values and combining this with domain knowledge and statistical measures, you can make informed decisions about which columns to include in your model.
Correlation is a statistical measure that expresses the extent to which two variables are linearly related Correlation refers to the degree to which the variables change together or co-vary.
Knowing the correlation helps uncover important relationships between elements we are investigating. It provides insight into how changes in one variable may correlate with or predict changes in another. As researchers we rely on correlation to better understand the links between different phenomena.
The correlation coefficient quantifies the strength and direction of the correlation. Values closer to 1 or -1 represent stronger correlations, while those closer to 0 indicate little connection between the variables.
Why correlation is important for machine learning?
correlation_matrix = boston_house.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.show()
So, RM, LSTAT, INDUS, PTRATIO have the most correlation with MEDV(target var)
A scatter plot displays individual data points based on their values for two variables, making it easy to observe relationships, clusters, and outliers.
for attribute in boston_house.columns:
if attribute == 'MEDV': continue
plt.figure(figsize=(10, 6))
plt.scatter(boston_house[attribute], boston_house['MEDV'],
alpha=0.6, edgecolors='w', s=100)
plt.title(f'Scatter Plot of {attribute} vs MEDV')
plt.xlabel(attribute)
plt.ylabel('MEDV')
plt.grid(True)
plt.show()
A hexbin plot is useful for visualizing the density of data points when dealing with large datasets. It bins data points into hexagonal bins and uses color to represent the number of points in each bin.
for attribute in boston_house.columns:
if attribute == 'MEDV': continue
plt.figure(figsize=(10, 6))
hb = plt.hexbin(boston_house[attribute], boston_house['MEDV'],
gridsize=30, cmap='viridis')
cb = plt.colorbar(hb, label='Count')
plt.title(f'Hexbin Plot of {attribute} vs MEDV')
plt.xlabel(attribute)
plt.ylabel('MEDV')
plt.show()
Variable analysis can be either univariate, bivariate, or multivariate. Each of them provides insights into the distribution and correlations between the dataset's variables. Techniques vary depending on the number of variables analyzed:
The main focus in univariate analysis is on examining each variable within our dataset on its own. During this analysis, we can uncover insights such as the median, mode, maximum, range, and outliers.
This type of analysis is applicable to both categorical and numerical variables.
Bivariate analysis aims to reveal insights between two chosen variables and focuses on understanding the distribution and relationship between these two variables.
As we analyze two variables at the same time, this type of analysis can be trickier. It can encompass three different pairs of variables: numerical-numerical, numerical-categorical, and categorical-categorical.
sns.pairplot(boston_house, hue='MEDV', diag_kind='kde', palette='bright')
<seaborn.axisgrid.PairGrid at 0x7a90c94a6380>
A frequent challenge with large datasets is the simultaneous analysis of multiple variables. Even though univariate and bivariate analysis methods offer valuable insights, this is usually not enough for analyzing datasets containing multiple variables (usually more than five).
This issue of managing high-dimensional data, usually referred to as the curse of dimensionality, is well-documented. Having a large number of variables can be advantageous as it allows the extraction of more insights. At the same time, this advantage can be against us due to the limited number of techniques available for analyzing or visualizing multiple variables concurrently.
Missing values are a common issue in machine learning. This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models. It is essential to address missing values efficiently to ensure strong and impartial results in your machine-learning projects
boston_house.isnull().sum()
CRIM 0 ZN 0 INDUS 0 CHAS 26 NOX 0 RM 0 AGE 0 DIS 27 RAD 0 TAX 0 PTRATIO 0 B 20 LSTAT 0 MEDV 54 dtype: int64
boston_house_cleaned = boston_house.copy()
because MEDV is our target var and our problem is regression, the nan value is not valuable for us, so let's remove those rows that hav nan MEDV
boston_house_cleaned.dropna(subset=['MEDV'], inplace=True)
boston_house_cleaned.reset_index(drop=True, inplace=True)
boston_house_cleaned.isnull().sum()
CRIM 0 ZN 0 INDUS 0 CHAS 24 NOX 0 RM 0 AGE 0 DIS 25 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 MEDV 0 dtype: int64
boston_house_cleaned.shape
(452, 14)
Here are some common imputation methods:
Now, for handling missing values for column B, we use mean as our imputation method
boston_house_cleaned['B'].fillna(boston_house_cleaned['B'].mean(),
inplace=True)
boston_house_cleaned.isnull().sum()
CRIM 0 ZN 0 INDUS 0 CHAS 24 NOX 0 RM 0 AGE 0 DIS 25 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 MEDV 0 dtype: int64
These fill methods are particularly useful when there is a logical sequence or order in the data, and missing values can be reasonably assumed to follow a pattern. The method parameter in fillna() allows to specify the filling strategy, and here, it’s set to ‘ffill’ for forward fill and ‘bfill’ for backward fill.
Linear interpolation assumes a straight line between two adjacent non-missing values.
Quadratic interpolation assumes a quadratic curve that passes through three adjacent non-missing values.
Prediction is a statistical method that involves using the values of other variables to predict the missing values. This method can be used when there is a relationship between the variables in the data
Let's use KNN to predict missing values for DIS attribute.
boston_house_features = boston_house_cleaned.loc[:, boston_house_cleaned.columns != 'MEDV']
imputer = KNNImputer(n_neighbors=5)
imputed = imputer.fit_transform(boston_house_features)
imputed = pd.DataFrame(imputed, columns=boston_house_features.columns)
boston_house_cleaned.loc[:, 'DIS'] = imputed['DIS']
boston_house_cleaned.isnull().sum()
CRIM 0 ZN 0 INDUS 0 CHAS 24 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 MEDV 0 dtype: int64
Now, let's impute CHAS column using random forest classifier
X = boston_house_cleaned.dropna().drop("CHAS", axis=1).values
Y = boston_house_cleaned.dropna()["CHAS"].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=0)
forest = RandomForestClassifier(n_estimators=45, max_depth=25, random_state=False,
max_features=0.6, min_samples_leaf=3, n_jobs=-1)
forest.fit(X_train, Y_train)
y_pred_train = forest.predict(X_train)
y_pred = forest.predict(X_test)
y_pred_proba = forest.predict_proba(X_test)
accuracy_train = accuracy_score(Y_train, y_pred_train)
accuracy_test = accuracy_score(Y_test, y_pred)
print("ACCURACY: TRAIN=%.4f TEST=%.4f" % (accuracy_train,accuracy_test))
print("LOG LOSS: "+str(log_loss(Y_test, y_pred_proba)))
ACCURACY: TRAIN=0.9688 TEST=0.8605 LOG LOSS: 0.21214434107942406
buston_house_with_CHAS = boston_house_cleaned[boston_house_cleaned['CHAS'].notna()]
buston_house_no_CHAS = boston_house_cleaned[boston_house_cleaned['CHAS'].isna()]
buston_house_no_CHAS = buston_house_no_CHAS.drop('CHAS', axis=1)
prediction = forest.predict(buston_house_no_CHAS.values)
buston_house_no_CHAS.insert(0, 'CHAS', prediction)
frames = [buston_house_with_CHAS, buston_house_no_CHAS]
boston_house_cleaned = pd.concat(frames)
boston_house_cleaned.isnull().sum()
CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 MEDV 0 dtype: int64
boston_house_cleaned
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.900000 | 4.98 | 24.0 |
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.900000 | 9.14 | 21.6 |
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.830000 | 4.03 | 34.7 |
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 369.650773 | 2.94 | 33.4 |
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.900000 | 5.33 | 36.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 433 | 4.83567 | 0.0 | 18.10 | 0.0 | 0.583 | 5.905 | 53.2 | 3.1523 | 24 | 666.0 | 20.2 | 388.220000 | 11.45 | 20.6 |
| 434 | 0.15086 | 0.0 | 27.74 | 0.0 | 0.609 | 5.454 | 92.7 | 1.8209 | 4 | 711.0 | 20.1 | 395.090000 | 18.06 | 15.2 |
| 435 | 0.18337 | 0.0 | 27.74 | 0.0 | 0.609 | 5.414 | 98.3 | 1.7554 | 4 | 711.0 | 20.1 | 344.050000 | 23.97 | 7.0 |
| 436 | 0.20746 | 0.0 | 27.74 | 0.0 | 0.609 | 5.093 | 98.0 | 1.8226 | 4 | 711.0 | 20.1 | 318.430000 | 29.68 | 8.1 |
| 437 | 0.10574 | 0.0 | 27.74 | 0.0 | 0.609 | 5.983 | 98.8 | 1.8681 | 4 | 711.0 | 20.1 | 390.110000 | 18.07 | 13.6 |
452 rows × 14 columns
boston_house_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Index: 452 entries, 0 to 437 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 452 non-null float64 1 ZN 452 non-null float64 2 INDUS 452 non-null float64 3 CHAS 452 non-null float64 4 NOX 452 non-null float64 5 RM 452 non-null float64 6 AGE 452 non-null float64 7 DIS 452 non-null float64 8 RAD 452 non-null int64 9 TAX 452 non-null float64 10 PTRATIO 452 non-null float64 11 B 452 non-null float64 12 LSTAT 452 non-null float64 13 MEDV 452 non-null float64 dtypes: float64(13), int64(1) memory usage: 53.0 KB
boston_house_cleaned.describe()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 452.000000 | 452.000000 | 452.000000 | 452.000000 | 452.000000 | 452.000000 | 452.000000 | 452.000000 | 452.000000 | 452.000000 | 452.000000 | 452.000000 | 452.000000 | 452.000000 |
| mean | 1.420825 | 12.721239 | 10.304889 | 0.070796 | 0.540816 | 6.343538 | 65.557965 | 4.044987 | 7.823009 | 377.442478 | 18.247124 | 369.650773 | 11.441881 | 23.750442 |
| std | 2.495894 | 24.326032 | 6.797103 | 0.256769 | 0.113816 | 0.666808 | 28.127025 | 2.097174 | 7.543494 | 151.327573 | 2.200064 | 68.384000 | 6.156437 | 8.808602 |
| min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 6.300000 |
| 25% | 0.069875 | 0.000000 | 4.930000 | 0.000000 | 0.447000 | 5.926750 | 40.950000 | 2.341350 | 4.000000 | 276.750000 | 16.800000 | 376.925000 | 6.587500 | 18.500000 |
| 50% | 0.191030 | 0.000000 | 8.140000 | 0.000000 | 0.519000 | 6.229000 | 71.800000 | 3.539200 | 5.000000 | 307.000000 | 18.600000 | 391.955000 | 10.250000 | 21.950000 |
| 75% | 1.211460 | 20.000000 | 18.100000 | 0.000000 | 0.605000 | 6.635000 | 91.625000 | 5.404800 | 7.000000 | 411.000000 | 20.200000 | 396.060000 | 15.105000 | 26.600000 |
| max | 9.966540 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 34.410000 | 50.000000 |
Often, a DataFrame will contain columns that are not useful to your analysis. Such columns should be dropped from the DataFrame to make it easier for you to focus on the remaining columns.
If we have a lot of missing values in a column, we can just remove that column from our dataset. The only problem with this method is that we lose some information needed by our model to make accurate predictions.
Also We can remove the columns which have a low correlation with the target.
In this case, we can remove the CHAS and DIS features as they have a correlation of 13% and 14% with the target.
boston_house_cleaned.drop(['CHAS', 'DIS',], axis=1, inplace=True)
boston_house_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Index: 452 entries, 0 to 437 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 452 non-null float64 1 ZN 452 non-null float64 2 INDUS 452 non-null float64 3 NOX 452 non-null float64 4 RM 452 non-null float64 5 AGE 452 non-null float64 6 RAD 452 non-null int64 7 TAX 452 non-null float64 8 PTRATIO 452 non-null float64 9 B 452 non-null float64 10 LSTAT 452 non-null float64 11 MEDV 452 non-null float64 dtypes: float64(11), int64(1) memory usage: 45.9 KB
Categorical data can be put in groups or categories using names or labels. Each piece of a categorical dataset, also known as qualitative data.
In this dataset CHAS and RAD are categorical features, both are nominal.
Data expressed in numerical terms rather than in natural language descriptions are called numerical data. It can only be gathered in numerical form, keeping its name. This numerical data type also referred to as quantitative data can be used to measure a person’s height, weight, IQ, etc.
Except RAD and CHAS, other features are numerical in this dataset
Normalization means scaling the values of the features to a fixed range. For example, we can scale the values of the features to the range of [0, 1] or [-1, 1]. This method is useful when we have no outliers and the data lies in a fixed range. We can use the MinMaxScaler method to do this. We can't use normarlziation when we are not using algorithms such as KNN or Neural Networks which are based on distance. Below is the formula for the StandardScaler method: $$ X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}} $$
Standardization means scaling the values of the features to have a mean of 0 and a standard deviation of 1. This method is useful when we have features with different means and standard deviations. We can use the StandardScaler method to do this. Below is the formula for the StandardScaler method: $$ X_{std} = \frac{X - 𝝻}{𝛔} $$ Standardization can be helpful in cases where the data follows a Gaussian distribution
standardized = StandardScaler().fit_transform(boston_house_cleaned)
standardized = pd.DataFrame(standardized, columns=boston_house_cleaned.columns)
standardized.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 452 entries, 0 to 451 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 452 non-null float64 1 ZN 452 non-null float64 2 INDUS 452 non-null float64 3 NOX 452 non-null float64 4 RM 452 non-null float64 5 AGE 452 non-null float64 6 RAD 452 non-null float64 7 TAX 452 non-null float64 8 PTRATIO 452 non-null float64 9 B 452 non-null float64 10 LSTAT 452 non-null float64 11 MEDV 452 non-null float64 dtypes: float64(12) memory usage: 42.5 KB
standardized.describe()
| CRIM | ZN | INDUS | NOX | RM | AGE | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.520000e+02 | 4.520000e+02 | 4.520000e+02 | 4.520000e+02 | 4.520000e+02 | 4.520000e+02 | 4.520000e+02 | 452.000000 | 4.520000e+02 | 4.520000e+02 | 4.520000e+02 | 4.520000e+02 |
| mean | -6.287989e-17 | 7.859986e-18 | 1.414797e-16 | 4.715992e-17 | 1.057168e-15 | 5.501990e-17 | 9.431983e-17 | 0.000000 | -1.477677e-15 | 6.287989e-16 | 2.947495e-16 | 4.165793e-16 |
| std | 1.001108e+00 | 1.001108e+00 | 1.001108e+00 | 1.001108e+00 | 1.001108e+00 | 1.001108e+00 | 1.001108e+00 | 1.001108 | 1.001108e+00 | 1.001108e+00 | 1.001108e+00 | 1.001108e+00 |
| min | -5.673607e-01 | -5.235270e-01 | -1.450000e+00 | -1.370533e+00 | -4.177548e+00 | -2.230147e+00 | -9.054914e-01 | -1.259873 | -2.569644e+00 | -5.406820e+00 | -1.579264e+00 | -1.983263e+00 |
| 25% | -5.418687e-01 | -5.235270e-01 | -7.916380e-01 | -8.251900e-01 | -6.257419e-01 | -8.758563e-01 | -5.073570e-01 | -0.666131 | -6.584934e-01 | 1.064911e-01 | -7.893786e-01 | -5.967190e-01 |
| 50% | -4.932732e-01 | -5.235270e-01 | -3.188547e-01 | -1.918880e-01 | -1.719604e-01 | 2.221690e-01 | -3.746456e-01 | -0.466012 | 1.605713e-01 | 3.265229e-01 | -1.938136e-01 | -2.046224e-01 |
| 75% | -8.397672e-02 | 2.995485e-01 | 1.148099e+00 | 5.645562e-01 | 4.375855e-01 | 9.277881e-01 | -1.092227e-01 | 0.222000 | 8.886288e-01 | 3.866181e-01 | 5.956657e-01 | 3.238556e-01 |
| max | 3.427703e+00 | 3.591850e+00 | 2.567922e+00 | 2.904255e+00 | 3.657970e+00 | 1.225874e+00 | 2.146872e+00 | 2.206651 | 1.707694e+00 | 3.989153e-01 | 3.734883e+00 | 2.983293e+00 |
There are many ways to encode the categorical features. Some of them are as follows:
This method is useful when the categories have an order. For example, if we have a feature with 3 categories, we can encode them as follows:
Category 1: 0
Category 2: 1
Category 3: 2
This method is useful when the categories don't have an order. It is the most useful method for the algorithms that use the distance between the data points, such as KNN. For example, if we have a feature with 3 categories, we can encode them as follows:
Category 1: 1, 0, 0
Category 2: 0, 1, 0
Category 3: 0, 0, 1
This method is useful when the categories don't have an order. It is somehow similar to the One-Hot Encoding method. For example, if we have a feature with 3 categories, we can encode them as follows:
Category 1: 00
Category 2: 01
Category 3: 10
This method is useful when the categories don't have an order. For example, if we have a feature with 3 categories, we can encode them as follows:
Category 1: 0.5
Category 2: 0.25
Category 3: 0.25
This method is useful when the categories don't have an order. For example, if we have a feature with 3 categories, we can encode them as follows:
Category 1: 0.5
Category 2: 0.25
Category 3: 0.75
As there are no categorical features that store as string or object in this dataset, we don't need to encode them.
A training data set is a data set of examples used during the learning process and is used to fit the parameters (e.g., weights) of, for example, a classifier.
A test data set is a data set that is independent of the training data set, but that follows the same probability distribution as the training data set. If a model fit to the training data set also fits the test data set well, minimal overfitting has taken place (see figure below). A better fitting of the training data set as opposed to the test data set usually points to over-fitting.
A validation data set is a data set of examples used to tune the hyperparameters (i.e. the architecture) of a classifier. It is sometimes also called the development set or the "dev set". An example of a hyperparameter for artificial neural networks includes the number of hidden units in each layer. It, as well as the testing set (as mentioned below), should follow the same probability distribution as the training data set. The validation set is a set of data that is used to validate model performance during training.
Some common percentages for splitting the dataset into train and test sets are as follows:
There are some ways to split the dataset into train and test sets. Some of them are as follows:
This method is the most common method. But it has a problem. If we split the dataset randomly, the train and test sets may not have the same distribution.
This method is useful when we have a time series dataset. But it is not useful in this case.
This method is useful when we have an imbalanced dataset.
Here we use the first method.
X_train, X_test, y_train, y_test = train_test_split(standardized.drop('MEDV', axis=1),
standardized['MEDV'],
test_size=0.2,
random_state=42)
In this section, we will train, test and evaluate the models. We will use the following models:
In the predictive or supervised learning approach, the goal is to learn a mapping from inputs x to outputs y, given a labeled set of input-output pairs $$D = \left \{ (xi , yi) \right \}_{i=1}^{N} $$. Here D is called the training set, and N is the number of training examples.
The second main type of machine learning is the descriptive or unsupervised learning approach. Here we are only given inputs, $$D = \left \{xi \right \}_{i=1}^{N}$$ , and the goal is to find “interesting patterns” in the data. This is sometimes called knowledge discovery. This is a much less well-defined problem, since we are not told what kinds of patterns to look for, and there is no obvious error metric to use (unlike supervised learning, where we can compare our prediction of y for a given x to the observed value).
When dealing with missing data, it is helpful to distinguish the cases when there is missing- ness only at test time (so the training data is complete data), from the harder case when there is missingness also at training time. We will discuss these two cases below. Note that the class label is always missing at test time, by definition; if the class label is also sometimes missing at training time, the problem is called semi-supervised learning.
Reinforcement Learning (RL) is the science of decision making. It is about learning the optimal behavior in an environment to obtain maximum reward. In RL, the data is accumulated from machine learning systems that use a trial-and-error method. Data is not part of the input that we would find in supervised or unsupervised machine learning.
Similarly the form of the output or response variable can in principle be anything, but most methods assume that yi is a categorical or nominal variable from some finite set, $$y_i ∈ \left \{1, . . . , C\right \}$$ (such as male or female), or that yi is a real-valued scalar (such as income level). When yi is categorical, the problem is known as classification or pattern recognition, and when yi is real-valued, the problem is known as regression. Another variant, known as ordinal regression, occurs where label space Y has some natural ordering, such as grades A–F.
Main form of simple linear regression function: $$f(x) = \alpha x + \beta$$ here we want to find the bias ($\alpha$) and slope($\beta$) by minimizing the derivation of the Residual Sum of Squares (RSS) function: - step 1: Compute RSS of the training data $$ RSS = \Sigma (y_i - (\hat{\beta} + \hat{\alpha} * x_i) )^2 $$ - step 2: Compute the derivatives of the RSS function in terms of $\alpha$ and $\beta$, and set them equal to 0 to find the desired parameters $$ \frac{\partial RSS}{\partial \beta} = \Sigma (-f(x_i) + \hat{\beta} + \hat{\alpha} * x_i) = 0$$ $$ \to \beta = \bar{y} - \hat{\alpha} \bar{x} \to (1)$$ $$ \frac{\partial RSS}{\partial \alpha} = \Sigma (-2 x_i y_i + 2 \hat{\beta} x_i + 2\hat{\alpha} x_i ^ 2) = 0 \to (2)$$ $$ (1) , (2) \to \hat{\alpha} = \frac{\Sigma{(x_i - \bar{x})(y_i - \bar{y})}}{\Sigma{(x_i - \bar{x})^2}} $$ $$ \hat{\beta} = \bar{y} - \hat{a} \bar{x}$$ Based on the above formula, implement the function below to compute the parameters of a simple linear regression
class LinearRegression:
def __init__(self, n_samples):
self._n_samples = n_samples
self._slope = None
self._bias = None
@property
def slope(self):
return self._slope
@property
def bias(self):
return self._bias
def train(self, X, Y):
mean_X = np.mean(X)
mean_Y = np.mean(Y)
SS_XY = np.sum(X * Y) - self._n_samples * mean_X * mean_Y
SS_XX = np.sum(X * X) - self._n_samples * mean_X * mean_X
self._bias = SS_XY / SS_XX
self._slope = mean_Y - self._bias * mean_X
def predict(self, X):
return self._slope + self.bias * X
Because we have one degree regressino function we should only choose one feature to predict MEDV feature, so we select feature based on its correlation with MEDV
model = LinearRegression(X_train.shape[0])
below code is for drawwing plot
def draw_regression_plot(X_test, y_test, y_predict, xlabel, ylabel):
plt.figure(figsize=(10,6))
plt.scatter(X_test, y_test, color='blue', label='y_test')
plt.scatter(X_test, y_predict, color='red', label='y_predict')
max_val = max(max(y_test), max(y_predict))
min_val = max(max(y_test), max(y_predict))
plt.plot([min_val, max_val], [min_val, max_val], 'b--', label='x = y')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.legend()
plt.show()
The RSS measures the amount of error remaining between the regression function and the data set after the model has been run. A smaller RSS figure represents a regression function that is well-fit to the data. $$ RSS = \Sigma_{i=1}^{n} (y_i - \hat{y_i})^2 $$
Mean Squared Error (MSE) is another metric used to evaluate the performance of regression models. Unlike R-Squared and Adjusted R-Squared, which are measures of explained variance, MSE is a measure of prediction error. Specifically, it quantifies the average squared difference between the actual and predicted values. $$ MSE = \frac{1}{n}\Sigma_{i=1}^{n}(y_i - \hat{y_i})^2 $$
The Root Mean Squared Error (RMSE) is one of the two main performance indicators for a regression model. It measures the average difference between values predicted by a model and the actual values. It provides an estimation of how well the model is able to predict the target value (accuracy). $$ RMSE = \sqrt{\frac{\Sigma_{i=1}^{n}(y_i - \hat{y_i})^2}{n}} $$
R-Squared (R²) is a statistical measure used to determine the proportion of variance in a dependent variable that can be predicted or explained by an independent variable. R-Squared values range from 0 to 1. A value of 1 indicates that the model predicts 100% of the relationship, and a value of 0.5 indicates that the model predicts 50%, and so on.
model.train(X_train['RM'].values, y_train.values)
model.bias
0.7575486284708407
model.slope
0.00918539734800285
y_pred = model.predict(X_test['RM'].values)
r2_score(y_test.values, y_pred)
0.3215543611776813
mean_squared_error(y_test.values, y_pred, squared=True)
0.4877323883782655
draw_regression_plot(X_test=X_test['RM'].values, y_test=y_test.values,
y_predict=y_pred, xlabel='RM', ylabel='MEDV')
model.train(X_train['LSTAT'].values, y_train.values)
model.bias
-0.72714820201285
model.slope
0.022009976588282402
y_pred = model.predict(X_test['LSTAT'].values)
r2_score(y_test.values, y_pred)
0.44427364365873356
mean_squared_error(y_test.values, y_pred, squared=True)
0.3995098907755855
draw_regression_plot(X_test=X_test['LSTAT'].values, y_test=y_test.values,
y_predict=y_pred, xlabel='LSTAT', ylabel='MEDV')
model.train(X_train['PTRATIO'].values, y_train.values)
model.bias
-0.4845873537606794
model.slope
0.002133682679928143
y_pred = model.predict(X_test['LSTAT'].values)
r2_score(y_test.values, y_pred)
0.4454465921086421
mean_squared_error(y_test.values, y_pred, squared=True)
0.3986666619062667
draw_regression_plot(X_test=X_test['PTRATIO'].values, y_test=y_test.values,
y_predict=y_pred, xlabel='PTRATIO', ylabel='MEDV')
Before start we add new column named, CMEDV(categorical MEDV) that is 1(luxury) for those more than median and 0(normal) for those less than median
median_MEDV = standardized['MEDV'].quantile(0.50)
standardized['CMEDV'] = standardized['MEDV'].apply(lambda x : 1 if x > median_MEDV else 0)
standardized['CMEDV'].astype('category').describe()
count 452 unique 2 top 0 freq 226 Name: CMEDV, dtype: int64
Now, let's divide our new train test columns
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
standardized.drop(['MEDV', 'CMEDV'], axis=1), standardized['CMEDV'],
test_size=0.2, random_state=42)
Decision tree is a classification algorithm which is based on the decision rules. It classifies a data point based on the decision rules. The decision rules are based on the features and their values
Decision tree pruning is a technique used to prevent decision trees from overfitting the training data. Pruning aims to simplify the decision tree by removing parts of it that do not provide significant predictive power, thus improving its ability to generalize to new data.
Pre-pruning involves setting a maximum tree depth or minimum number of samples per leaf before the tree is constructed. This can help prevent the tree from becoming too complex and overfitting to the training data.
pros:
cons:
Post-pruning involves trimming the tree after it has already been constructed. One technique for post-pruning is called reduced error pruning, where nodes are removed from the tree if the accuracy of the model does not improve after the split.
pros:
cons:
Decision trees are a good choice when the relationship between the predictors and the response is complex and non-linear, when interpretability is important, when dealing with missing values or outliers, when handling large and high-dimensional data, and when the target variable has more than two classes.
dtree_clf = DecisionTreeClassifier(random_state=42)
dtree_clf.fit(X_train_clf.values, y_train_clf.values)
y_pred_clf = dtree_clf.predict(X_test_clf.values)
dtree_clf_report = classification_report(y_test_clf, y_pred_clf, output_dict=True)
pd.DataFrame(dtree_clf_report).transpose()
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0 | 0.923077 | 0.734694 | 0.818182 | 49.000000 |
| 1 | 0.750000 | 0.928571 | 0.829787 | 42.000000 |
| accuracy | 0.824176 | 0.824176 | 0.824176 | 0.824176 |
| macro avg | 0.836538 | 0.831633 | 0.823985 | 91.000000 |
| weighted avg | 0.843195 | 0.824176 | 0.823538 | 91.000000 |
plt.figure(figsize=(20,10))
plot_tree(dtree_clf, feature_names=X_train_clf.columns,
class_names=['normal', 'lux'], filled=True)
plt.show()
KNN is a classification algorithm which is based on distance. It classifies a data point based on the majority of the classes of the k nearest neighbors.
The major difference between KNN and other classification algorithms like SVM (Support Vector Machines), Random Forests, Logistic Regression etc., lies in the fact that KNN does not learn through explicit training, while other algorithms do
The 1-N-N classifier is one of the oldest methods known. The idea is extremely simple: to classify X find its closest neighbor among the training
points (call it X') and assign to X the label of X.
pros:
cons:
Jaccard Distance: The Jaccard coefficient is a similar method of comparison to the Cosine Similarity due to how both methods compare one type of attribute distributed among all data. The Jaccard approach looks at the two data sets and finds the incident where both values are equal to 1. So the resulting value reflects how many 1 to 1 matches occur in comparison to the total number of data points. This is also known as the frequency that 1 to 1 match, which is what the Cosine Similarity looks for, how frequent a certain attribute occurs. $$ J(A, B) = \frac{\left | A \cap B \right |}{\left | A \cup B \right |} $$
Hamming Distance: Hamming distance is a metric for comparing two binary data strings. While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different. The Hamming distance method looks at the whole data and finds when data points are similar and dissimilar one to one. The Hamming distance gives the result of how many attributes were different.
This is used mostly when you one-hot encode your data and need to find distances between the two binary vectors.
knn = KNeighborsClassifier(n_neighbors=70)
knn.fit(X_train_clf.values, y_train_clf.values)
y_pred_clf = dtree_clf.predict(X_test_clf.values)
knn_report = classification_report(y_test_clf, y_pred_clf, output_dict=True)
pd.DataFrame(knn_report).transpose()
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0 | 0.923077 | 0.734694 | 0.818182 | 49.000000 |
| 1 | 0.750000 | 0.928571 | 0.829787 | 42.000000 |
| accuracy | 0.824176 | 0.824176 | 0.824176 | 0.824176 |
| macro avg | 0.836538 | 0.831633 | 0.823985 | 91.000000 |
| weighted avg | 0.843195 | 0.824176 | 0.823538 | 91.000000 |
GridSearchCV acts as a valuable tool for identifying the optimal parameters for a machine learning model. Here’s the process: You provide GridSearchCV with a set of Scoring parameter to experiment with, and it systematically explores each possible combination. For every combination, it evaluates the model’s performance by testing it on various sections of the dataset to gauge its accuracy.
After exhaustively trying out all the combinations, GridSearchCV presents you with the combination of settings that yielded the most favorable outcomes. This streamlines the process of fine-tuning your model, ensuring it operates optimally for your specific task without incurring excessive computational expenses.
criterion: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.
splitter: The strategy used to choose the split at each node. Supported strategies are "best" to choose the best split and "random" to choose the best random split.
max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split: The minimum number of samples required to split an internal node.
min_samples_leaf: The minimum number of samples required to be at a leaf node.
grid = {'criterion': ['gini', 'entropy'], 'splitter': ['best', 'random'],
'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None],
'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'random_state': [54]}
tree_grid_search = GridSearchCV(estimator=DecisionTreeClassifier(),
param_grid=grid, scoring='accuracy', cv=10)
tree_grid_search.fit(X_train_clf.values, y_train_clf.values)
print(f'Best Accuracy: {tree_grid_search.best_score_ * 100:5.2f}%')
print(f'Best Parameters: {tree_grid_search.best_params_}')
print(f'Decision Tree Accuracy: {tree_grid_search.score(X_test_clf.values, y_test_clf.values) * 100:5.2f}%')
Best Accuracy: 88.09%
Best Parameters: {'criterion': 'entropy', 'max_depth': 6, 'min_samples_leaf': 5, 'min_samples_split': 2, 'random_state': 54, 'splitter': 'best'}
Decision Tree Accuracy: 87.91%
n_neighbors: Number of neighbors to use by default for kneighbors queries.
weights: Weight function used in prediction. Possible values:
uniform: uniform weights. All points in each neighborhood are weighted equally.
distance: weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
metric: The distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric.
algorithm: Algorithm used to compute the nearest neighbors.
leaf_size: Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
p: Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
grid = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 25, 30, 35, 40, 45, 50, 70],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski'],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'leaf_size': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'p': [1, 2]}
knn_grid_search = GridSearchCV(estimator=KNeighborsClassifier(),
param_grid=grid, scoring='accuracy', cv=10)
knn_grid_search.fit(X_train_clf.values, y_train_clf.values)
print(f'Best Accuracy: {knn_grid_search.best_score_ * 100:5.2f}%')
print(f'Best Parameters: {knn_grid_search.best_params_}')
print(f'KNN Accuracy: {knn_grid_search.score(X_test_clf.values, y_test_clf.values) * 100:5.2f}%')
Best Accuracy: 88.08%
Best Parameters: {'algorithm': 'auto', 'leaf_size': 10, 'metric': 'euclidean', 'n_neighbors': 12, 'p': 1, 'weights': 'uniform'}
KNN Accuracy: 83.52%
The used models are not overfitting or underfitting. The reason is that the models are not too complex or too simple. This is why both the training accuracy and the test accuracy are almost the same and they are high enough. We can prevent overfitting and underfitting by using validation set and cross-validation. GridSearchCV uses cross-validation to prevent overfitting and underfitting.
Ensemble methods are powerful techniques in machine learning that combine multiple models to improve overall performance. These methods leverage the strengths of individual models, often resulting in better predictive performance and robustness compared to single models
Bagging: Bagging involves training multiple instances of the same model on different subsets of the training data, obtained by random sampling with replacement (bootstrap samples). The predictions of these models are then averaged (regression) or voted upon (classification).
Boosting: Boosting sequentially trains models, with each new model focusing on the errors made by the previous ones. The predictions are combined with a weighted vote.
| Aspect | Bagging | Boosting |
|---|---|---|
| Model Building | Parallel | Sequential |
| Model Dependency | Independent | Dependent |
| Data Sampling | Bootstrap sampling (with replacement) | Adaptive sampling (weighted instances) |
| Primary Focus | Reducing variance (overfitting) | Reducing bias and variance |
| Aggregation Method | Averaging or majority vote | Weighted combination |
| Typical Use Cases | High-variance models (e.g., decision trees) | Cases requiring strong models with reduced bias and variance |
Random forest is a commonly-used machine learning algorithm, trademarked by Leo Breiman and Adele Cutler, that combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems.
Bootstrapping involves creating multiple random subsets of the original dataset by sampling with replacement. This means that each subset can contain duplicate instances from the original dataset. The size of each subset is typically the same as the original dataset.
Bootstrapping plays a crucial role in the Random Forest algorithm, and it significantly impacts the performance and characteristics of the model in several ways:
If the number of observations is large, but the number of trees is too small, then some observations will be predicted only once or even not at all. If the number of predictors is large but the number of trees is too small, then some features can (theoretically) be missed in all subspaces used
Improving Accuracy: In general, increasing the number of trees tends to improve the accuracy of the Random Forest up to a certain point. With more trees, the model can capture more complex patterns and reduce variance.
Stabilizing Predictions: Adding more trees can also lead to more stable predictions, as the ensemble average or majority vote becomes more robust with a larger number of trees.
Computational Cost: However, there's a trade-off with computational cost. Training and making predictions with a larger number of trees can be more computationally expensive and may not always lead to a significant improvement in accuracy.
Empirical studies and practical experiences often suggest that the optimal number of trees lies between 100 and 500 for most datasets. However, for more complex problems or larger datasets, the optimal number might be higher.
Use Random Forest when you need a robust, accurate model that can handle high-dimensional data, non-linear relationships, and is tolerant to noise and outliers. Avoid it when real-time predictions, interpretability, or memory efficiency are critical requirements, or when dealing with very high dimensional sparse data where simpler models might be more effective.
Random Forests reduce variance through:
By reducing variance, Random Forests are able to create models that generalize better to new, unseen data, thus avoiding overfitting and improving overall predictive performance.
parameters explain link -> https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
rf_clf = RandomForestClassifier(random_state=42)
grid = {
'criterion': ['gini', 'entropy'],
'max_depth': [5, 6, 7, 8, 9, 10, None],
'min_samples_split': [2, 3, 4, 5, 6],
'min_samples_leaf': [1, 2, 3, 4],
}
rf_grid_search = GridSearchCV(estimator=rf_clf, param_grid=grid, cv=10, scoring='accuracy')
rf_grid_search.fit(X_train_clf.values, y_train_clf.values)
print(f'Best Accuracy: {rf_grid_search.best_score_ * 100:5.2f}%')
print(f'Best Parameters: {rf_grid_search.best_params_}')
print(f'Decision Tree Accuracy: {rf_grid_search.score(X_test_clf.values, y_test_clf.values) * 100:5.2f}%')
Best Accuracy: 90.86%
Best Parameters: {'criterion': 'entropy', 'max_depth': 8, 'min_samples_leaf': 1, 'min_samples_split': 3}
Decision Tree Accuracy: 90.11%
def plot_confusion_matrix(cm, title):
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title(title)
plt.show()
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train_clf.values, y_train_clf.values)
SVC(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(random_state=42)
y_pred_rbf = svm_rbf.predict(X_test_clf.values)
print(f"Accuracy: {accuracy_score(y_test_clf.values, y_pred_rbf)}")
print(classification_report(y_test_clf.values, y_pred_rbf))
Accuracy: 0.8571428571428571
precision recall f1-score support
0 0.89 0.84 0.86 49
1 0.82 0.88 0.85 42
accuracy 0.86 91
macro avg 0.86 0.86 0.86 91
weighted avg 0.86 0.86 0.86 91
cm_rbf = confusion_matrix(y_test_clf.values, y_pred_rbf)
plot_confusion_matrix(cm_rbf, "Confusion Matrix for RBF Kernel SVM")
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train_clf.values, y_train_clf.values)
SVC(kernel='linear', random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(kernel='linear', random_state=42)
y_pred_linear = svm_linear.predict(X_test_clf.values)
print(f"Accuracy: {accuracy_score(y_test_clf.values, y_pred_linear)}")
print(classification_report(y_test_clf.values, y_pred_linear))
Accuracy: 0.8571428571428571
precision recall f1-score support
0 0.89 0.84 0.86 49
1 0.82 0.88 0.85 42
accuracy 0.86 91
macro avg 0.86 0.86 0.86 91
weighted avg 0.86 0.86 0.86 91
cm_linear = confusion_matrix(y_test_clf.values, y_pred_linear)
plot_confusion_matrix(cm_linear, "Confusion Matrix for Linear Kernel SVM")
grid = {
'C': [0.1, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'linear']
}
svm_grid_search = GridSearchCV(SVC(random_state=42), grid,
refit=True,
cv=10)
svm_grid_search.fit(X_train_clf.values, y_train_clf.values)
print(f"Best Parameters: {svm_grid_search.best_params_}")
Best Parameters: {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
Use when you can afford the computational cost and need an exhaustive search over a well-defined grid of parameters.
param_dist = {
'C': expon(scale=100),
'gamma': expon(scale=0.1),
'kernel': ['rbf', 'linear']
}
svm_random_search = RandomizedSearchCV(SVC(random_state=42),
param_distributions=param_dist,
n_iter=100,
refit=True,)
svm_random_search.fit(X_train_clf.values, y_train_clf.values)
print(f"Best Parameters: {svm_random_search.best_params_}")
Best Parameters: {'C': 3.707589875069221, 'gamma': 0.15719133343963326, 'kernel': 'rbf'}
Use when you have many hyperparameters to tune, want a faster solution, and are okay with a probabilistic approach that might miss the absolute best but often finds very good solutions.
RandomizedSearchCV is often preferred for its efficiency and ability to handle large and complex hyperparameter spaces effectively.